Host factors are associated with vaginal microbiome structure in pregnancy in the ECHO Cohort Consortium

Using pooled vaginal microbiota data from pregnancy cohorts (N = 683 participants) in the Environmental influences on Child Health Outcomes (ECHO) Program, we analyzed 16S rRNA gene amplicon sequences to identify clinical and demographic host factors that associate with vaginal microbiota structure in pregnancy both within and across diverse cohorts. Using PERMANOVA models, we assessed factors associated with vaginal community structure in pregnancy, examined whether host factors were conserved across populations, and tested the independent and combined effects of host factors on vaginal community state types (CSTs) using multinomial logistic regression models. Demographic and social factors explained a larger amount of variation in the vaginal microbiome in pregnancy than clinical factors. After adjustment, lower education, rather than self-identified race, remained a robust predictor of L. iners dominant (CST III) and diverse (CST IV) (OR = 8.44, 95% CI = 4.06–17.6 and OR = 4.18, 95% CI = 1.88–9.26, respectively). In random forest models, we identified specific taxonomic features of host factors, particularly urogenital pathogens associated with pregnancy complications (Aerococcus christensenii and Gardnerella spp.) among other facultative anaerobes and key markers of community instability (L. iners). Sociodemographic factors were robustly associated with vaginal microbiota structure in pregnancy and should be considered as sources of variation in human microbiome studies.


16S rRNA gene amplicon data
After removing one low-quality sample, the total number of reads per sample ranged from 2629 to 406,377, with a mean of 63,704 (SD = 63,970).Reads from 683 samples were denoised into amplicon sequence variants (ASVs), from which a total of 5232 phylotypes were constructed after mapping to a common phylogenetic tree constructed from full-length vaginal 16S rRNA encoding alleles.As evident in ordination plots shown in Fig. 1A, B, phylogenetic placement of the sequences (Fig. 1B) removed much of the variation that was evident across sites prior to processing in MALiAmPi (Fig. 1A), although some of the remaining site variation may also be due to inherent differences in host factors across the cohorts.

The prevalence of CSTs varied by cohort
Vaginal phylotypes were classified using the VAginaL community state typE Nearest CentroId clAssifier (VALEN-CIA) 11    While almost all CSTs were found in each cohort, the prevalence of each CST significantly varied by cohort (Fig. 2).For the Atlanta cohort, the most common was CST III (47.6%), followed by CST IV-B (34.9%), whereas in the MARCH cohort, the most prevalent was CST I (38.2%), followed by CST III (34.1%).Although CST III and CST I were also the most common in WISC, the distribution was distinct from that in the MARCH cohort (Fig. 2).In contrast, non-L.iners Lactobacillus-dominant CSTs were less prevalent in the Atlanta cohort (Fig. 2).

Host factors associated with vaginal microbiota structure across cohorts
In single-factor, unadjusted permutational multivariate analysis of variance (PERMANOVA) models, both education and self-reported race accounted for the largest variance explained (4.28% and 4.26%, respectively; false discovery rate (FDR)-adjusted p-value < 0.001) in vaginal microbiota structure in pregnancy, followed by public insurance receipt (3% variance, FDR-adjusted p-value < 0.001) (Fig. 3).Antibiotics in pregnancy and age significantly contributed to 2% of the variance in vaginal microbiota structure (FDR-adjusted p-value < 0.01, each), while parity and smoking in pregnancy explained smaller amounts (Fig. 3).Although still significant after adjustment for multiple comparisons, early pregnancy BMI and hypertension each accounted for less than 1% of the variance in community structure.Self-reported Hispanic ethnicity, gestational diabetes, and sex of the infant were not associated with global community structure in pregnancy (Fig. 3).

Host factors independently associated with vaginal microbiota structure
We next used multifactor PERMANOVA models to assess the independent effects of host factors on vaginal communities.In aggregate, host factors accounted for nearly 12% of the overall variance in vaginal microbiota structure in pregnancy.Education and self-reported race remained the most robust host factors associated with vaginal microbiota structure in pregnancy (4% of variance for each, FDR-adjusted p < 0.01) (Fig. 3).Parity, antibiotic use in pregnancy, and age remained independent predictors of vaginal microbiome structure in pregnancy but slightly less so after accounting for other host factors (FDR-adjusted p-values 0.01 to < 0.05, respectively).The effects of early pregnancy BMI, hypertension, and public insurance receipt, in contrast, became attenuated after adjustment (Fig. 3).

Host factors associated with vaginal microbiota structure were largely conserved across cohorts
We next generated cohort-specific PERMANOVA models to visualize results for each cohort independently (Fig. 4).The results were largely consistent, especially in the MARCH and Atlanta cohorts in which data on the  www.nature.com/scientificreports/host factors were well-aligned, further validating their independent effects on microbiome variation.Specifically, education and parity exhibited consistently robust associations with the vaginal microbiota community structure within each cohort (Fig. 4).Of note, self-reported race was significant in the pooled PERMANOVA model but not in the cohort-specific models, likely due to the substantial differences in its distribution between cohorts (Fig. 4 and Table 2).Some of the associations with host factors in the other cohorts were diminished in the WISC cohort, and others, such as antibiotic use in pregnancy, were not available.

Taxonomic differences associated with robust host factors in pregnancy
We also examined taxonomic differences associated with host factors in pregnancy that may drive variation in vaginal microbiota structure.For each host factor that remained robust in the fully adjusted models of global community variance, we used random forest models to rank the most predictive phylotypes, which were classified to the species level.Of the taxa most predictive of educational attainment, L. iners was the most discriminant, followed by A. christensenii, Streptococcus oralis, and G. vaginalis (Fig. 5).The taxa most predictive of self-identified race were S. oralis followed by Lactobacillus gallinarum (Fig. 5).Fannyhessea vaginae (previously Atopobium vaginae) was the most predictive of antibiotic use in pregnancy followed by A. christensenii, L. iners, and G. vaginalis (Fig. 5).Staphylococcus epidermidis, Parvimonas micra, and L. iners were among the most predictive of parity.Staphylococcus epidermidis, P. micra, and L. iners were among the most predictive of parity, while Dialister micraerophilus was most predictive of host age (Fig. 5).Receiver operating curves (ROCs) for the top 20 taxa from the random forest models demonstrated areas under the curve (AUCs) ranging from 0.973 for self-identified race to 0.6114 for parity (Supplementary Fig. 1).www.nature.com/scientificreports/

Vaginal CSTs were associated with host factors
Due to their clinical relevance, we also tested the independent and joint associations between host factors and vaginal CSTs using a series of nested multinomial logistic regression models.CSTs were collapsed into three categories: L. iners-dominant (CST III), diverse (CST IV-B and IV-C) and non-L.iners Lactobacillus-dominant (i.e.CST I, II, V) CSTs, with the latter serving as the reference category.Prior to multivariable adjustment, we assessed the associations between host factors and CSTs for model selection (Supplementary Table 1).In the unadjusted model, self-identified White race was associated with a reduced odds of L. iners dominant (OR = 0.28, 95% CI = 0.19-0.42)and diverse CSTs (OR = 0.21, 95% CI = 0.14-0.32)compared to that of non-L.iners Lactobacillus-dominant CSTs (Table 3, Model 1).However, after adjustment for maternal education, the effect of race became attenuated and was no longer significant (Table3, Model 3

Discussion
Our results identified new associations between microbiota structure and host factors overall as well as with specific taxonomic signatures.We also verified some associations, such as host self-identified race and age, that had been previously described and further tested their independent effects after adjustment for other factors.
The most robust factors associated with vaginal microbiota structure in pregnancy were education, parity, and self-identified race, followed by prenatal antibiotic use.Together, education, age, self-identified race, antibiotic use, and parity explained a moderate amount of variance in prenatal vaginal community structure.Our results corroborate single-site studies linking vaginal community patterns and maternal education and age and parity 23,[29][30][31][32] .They are also consistent with prior smaller studies on host socio-demographics and the structure of other microbial niches as well as those among non-pregnant populations. 33hile self-identified race remained significantly associated with global microbiome structure, which was consistent with prior research, 14,25 , it was no longer significantly associated with vaginal community states in fully adjusted models.Race may reflect a range of unmeasured exposures, including maternal stress exposures related to discrimination and structural racism.In a recent paper, combined effects of individual and neighborhoodlevel measures of socioeconomic status were associated with vaginal microbiome composition 30 .Similar to our results, social host factors (i.e., education and self-identified race) were more closely related to the pregnancy microbiome than clinical factors (i.e., gestational diabetes, hypertension, early-pregnancy BMI, and antibiotics during pregnancy).These results suggest that exposures that occur over a longer period of time, compared to the relatively short period of gestation, have greater effects on the microbiota in pregnancy.Alternatively, social factors may be more influential because they reflect the physiological effects of multiple interacting host factors and unmeasured environmental factors, including urbanicity 34 , pollution, racial segregation, diet, and chronic stress 35,36 .Host factors associated with vaginal microbiota structure were also largely conserved across the MARCH and Atlanta cohorts, which are demographically and racially diverse.
Taken together, our results suggest that it is important to account for host factors in vaginal microbiota studies, as they may drive specific facets of community structure.The taxa most predictive of level of education, L. iners, is clinically relevant as a marker of instability of vaginal microbiome structure as well as bacterial vaginosis and pregnancy complications [37][38][39][40] .The other most discriminant taxa inversely associated with lower education were pathogens previously associated with pregnancy complications: Aerococcus christensenii, S. oralis, and G. vaginalis [41][42][43][44] .Taxa most discriminant of host factors across all cohorts were consistent with polymicrobial vaginal communities and bacterial vaginosis in contrast with the hallmarks of highly stable L. crispatus-dominant communities 15,45,46 , which confer pathogen resistance through the production of lactic acid and hydrogen peroxide by lowering vaginal pH and inflammation and which are critical for pregnancy maintenance.Our results also indicate that when phylogenetic distance is used to cluster and taxonomically classify 16S rRNA gene data, the effect of host factors on the vaginal community structure are remarkably consistent across populations.Furthermore, the distribution of CSTs varied by cohort, and such variation could be explained by the differential distribution of host factors, which should be considered in the design and analysis of future studies.
Strengths of the study include the large size and inclusion of well-characterized ECHO cohort metadata.We attempted to removed site-specific biases by using a common bioinformatic pipeline that included phylogenetic mapping and well-curated full-length 16S rRNA gene vaginal references.While we acknowledge that some technical sources of variation may have remained using this approach to harmonizing different cohorts' 16S rRNA gene sequence data, the magnitude of variation across sites appeared to be less than it was prior to phylogenetic scaffolding.Limitations include some differences in sample collection protocols i.e., WISC and Microbes, Allergy, Asthma, and Pets (MAAP) samples, collected at group B streptococcus screenings in the third trimester, were drawn from recto-vaginal samples, which may have explained some of the differences between the host factors that remained consistent between the MARCH and Atlanta cohorts but were more attenuated in the WISC and MAAP cohort.There were also site differences in sequence length and primer kits, which may have resulted in some artifacts in the taxa we identified with host factors.In addition, limited data availability for some factors that likely affect vaginal microbiome structure precluded comparisons, such as prenatal antibiotic use in the WISC, as well as measures of diet, cohabitation/marital status, and douching practices 23 .Since data were sequenced prior to this study, we were not able to harmonize and account for the effect of gestational week at sample collection.Therefore, we aligned the sequence data to the first collection in pregnancy, although this www.nature.com/scientificreports/varied from the first to the third trimester, which also may have explained some of the different results observed in the WISC and MAAP sites.
We disentangled host factors that may drive differences in microbial signatures across populations when their underlying distributions are differential.Our results suggest that host factors are plausible explanations for some of the inconsistent results in previous smaller studies, most of which, to date have failed to account for host factors.As such, our results have important implications for the design and analysis of future populationbased studies of the vaginal microbiome in pregnancy and underscore the need to fully account for the complex relationships between host factors and the microbiome.

Study population
The ECHO cohort is a consortium of birth cohorts from across the United States designed to evaluate the impact of in utero and early-life exposures on child health outcomes and includes detailed survey, medical record, and bio-specimen collections.The design and purpose of the ECHO cohort have been previously described 27,28 .All ECHO cohorts with available vaginal samples that had previously undergone 16S rRNA gene amplicon sequencing were included in the present study (Table 1), which meta-analyzed 16S rRNA gene amplicon data where available (Supplementary Materials, Fig. 2) from ECHO across a diverse set of sites and populations.The Atlanta cohort participants were recruited between 8 and 14 weeks gestations from prenatal clinics affiliated with two hospitals in Atlanta, GA.The MAAP cohort recruited pregnant women during their second and third trimesters from two hospital systems in metro Detroit, MI, to understand how exposures in early life modify risk for asthma 47 .The Children's Respiratory and Environmental Workgroup (CREW) includes both the WISC, drawn from rural medical centers in north-central Wisconsin 48 , and the MAAP cohort, drawn from urban metro-Detroit sites.The MARCH) a population-based cohort that recruited from initial prenatal appointments; for this study, we used a subset of the entire MARCH cohort that included both University of Michigan sites (MARCH U-M) and remote collection of samples from nine other sites across the state of Michigan.For this study, only those participants who provided informed consent for providing at least one vaginal swab sample during their pregnancy were included, and we analyzed the first sample collected in pregnancy.

Clinical and demographic measures
At all sites, demographic data and health-related practices were collected from prenatal surveys, and health conditions, infection, delivery information, and antibiotic use were abstracted from the medical record.In MARCH, detailed information, including the infant's sex, infant's birth weight, complications of pregnancy, and pre-pregnancy BMI and gestational age, was derived from the birth certificate.Host factor metadata were harmonized to include self-reported White versus non-White race, any antibiotics in pregnancy, and public insurance as a proxy for socioeconomic status since household income was collected differently across sites.

Vaginal sample collection
At the MARCH U-M sites, vaginal dual-headed dry swabs (Starplex™ Scientific S09D, Fisher Scientific) were self-collected in clinics.Immediately upon collection, AllProtect (Qiagen) preservative was added to the swabs, and they were stored for 24-48 h at −20 °C before being transported on ice for long-term storage at -80 °C.In the other MARCH sites, vaginal dual-headed dry swabs were self-collected at home, mailed to the laboratory, and archived at −80 °C.Similarly, in the Atlanta cohort, vaginal swabs were self-collected using the Sterile Catch-All Sample Swab (Epicentre), placed immediately in MoBio bead tubes, and transported on ice for archival storage at −80 °C.For both the MARCH and Atlanta cohorts, the first vaginal swab collected in pregnancy was analyzed.At the WISC and MAAP sites, vaginal/rectal swabs (Epicentre Catch-All) were collected by a provider within 6 weeks of delivery at the time of group B streptococcus screening.Swabs were stored in RNAlater (Thermo Fisher Scientific) at 4 °C for at least 24 h and then transferred to −80 °C.Prior to DNA extraction, swabs were thawed on ice and transferred to Lysing Matrix E (LME) tubes.RNAlater was transferred into a sterile tube and centrifuged at 16,000×g for 5 min at 4 °C.Pellets were re-suspended using cetyltrimethylammonium bromide buffer and transferred to the LME tube containing the swab 47 .

DNA extraction, library preparation, and sequencing
MARCH DNA was extracted from the MARCH U-M samples using the PowerMag kit (Qiagen; MoBio Laboratories) optimized for the epMotion 5075 TMX (Eppendorf).DNA samples were quantified using the Quant-iT Pico-Green dsDNA Assay kit (Thermo Fisher Scientific).The V4 region of the 16S rRNA gene was amplified using the dual-index sequencing strategy outlined in the MiSeq SOP 49 at the MARCH U-M Microbiome Core 50 .Amplicons were sequenced using 250 bp Illumina MiSeq (MiSeq Reagent 222 kit V2) for 500 cycles according to the manufacturer's instructions with modifications for the primer set.Libraries and sequencing reagents with custom read 1/read 2 and index primers added were prepared according to Illumina's protocol for 2 nM libraries.The final load concentration was 4 pM, spiked with 15% PhiX.
DNA was extracted from the other MARCH samples also using the DNEasy Powersoil DNA Isolation kit (Qiagen) per the Human Microbiome Project's protocol 51 and MiSeq SOP 49 .Polymerase chain reaction (PCR) amplification was performed on the V4 region of the 16S rRNA gene following the mothur wet lab documentation 52 , using primer sets SB501-SB508 and SA701-SA712 ordered from IDT. Successfully amplified triplicate PCR reactions were pooled and purified using Agencourt AMPure XP (Beckman Coulter), and the concentration of 16S rRNA gene amplicons was quantified using the Quant-IT dsDNA assay kit (Invitrogen).

Atlanta
Samples underwent amplification of the V3-V4 region of the 16S rRNA gene following a two-step PCR protocol 53 .Amplicons were sequenced on the Illumina HiSeq 2500 modified to generate 300 bp paired-end reads.Additional details have been published elsewhere 8 .

WISC and MAAP
The V4 region of the 16S rRNA gene was amplified in triplicate reactions per sample using 515F/806R primers and PCR conditions previously described 12,54 .Pooled amplicon reactions with 5 ng were purified using the SequalPrep Normalization Plate Kit according to the manufacturer's specifications, quantified using the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific), and pooled at equimolar concentrations.The amplicon library was constructed using the Agencourt AMPure XP system (Beckman-Coulter), quantified using the KAPA Library Quantification Kit (APA Biosystems), and diluted to 2 nM.Equimolar PhiX was added at 40% final volume to the amplicon library followed by sequencing on the Illumina NextSeq 500 platform employing a 2 × 150 bp sequencing run.Additional details have been published elsewhere for WISC 38 and MAAP 32 .

Bioinformatics
De-duplication was performed using the dual bar-code approach.We processed raw sequences from all sites together using the DADA2 Workflow for Big Data v.1.5.2 in order to cluster them into ASVs (https:// benjj neb.github.io/ dada2/ bigda ta.html) 55 .For the Atlanta and MARCH cohorts, forward and reverse reads were trimmed using lengths of 255 and 225 bp, respectively, and filtered using a minimum quality score of 2. For the WISC and MAAP cohort, reads were maintained if they exhibited a maximum expected error of 2 and a read length of at least 150 bp.We then processed the ASVs using MaLiAmPi 56,57 , a computational tool designed to robustly combine 16S amplicon data for meta-analysis using phylogenetic placement.Sequences are mapped to a common tree, which we constructed from full-length 16S rRNA allele data from NCBI (cached as a repository on Zenodo 58 ).We employed a minimum overlap at six for read-joining and removed Chimeras following the dada2 protocol.Given that samples had previously undergone isolation and sequencing at each site (Table 1), we harmonized 16S rRNA gene sequences using MaLiAmPi 56,57 .We used the "refpkg.nf" module in MaLiAmPi to recruit alleles from this repository and assemble them de novo via RAxMLv8 59 .Finally, the amplicon sequence variants from DADA2 55 were placed onto this phylogenetic tree via pplacer 60 , and metrics including alpha-diversity, pairwise phylogenetic distance, and taxonomic composition were derived via the pplacer utility guppy, per the pplacer_place_classify.nf module of MaLiAmPi.

Statistical analysis
Demographic and clinical characteristics across cohorts are summarized in Table 2. To test for significant differences in the distribution of participant characteristics, we used chi-square or t-tests as appropriate.To identify host factors associated with vaginal microbiota structure in pregnancy, we generated single-factor PERMANOVA models based on Bray-Curtis distances overall and separately for each cohort using adonis2 implementation in R's vegan package and dispersion using betsdispr based on 100,000 permutations.
We also used multifactor PERMANOVA models to examine the independent effects of host factors using a backward variable selection approach.We adjusted p-values for multiple comparisons using a Benjamini and Hochberg FDR criterion of p < 0.05 66 .To test the independent effects of host factors on vaginal CSTs, we constructed a series of nested multinomial logistic regression models from the most robust predictors of vaginal microbiome variation identified from the PERMANOVA results.In these models, we collapsed the six VALEN-CIA classifications into three categories: non-L.iners Lactobacillus dominant (CSTs I, II, and V [reference]), L. iners dominant (CST III), and diverse (CST IV).Covariates were selected using a criterion of p < 0.2 in the PERMANOVA models including self-identified race, education level, maternal age, receipt of antibiotics in pregnancy, and parity category.Using a stepwise forward selection approach beginning with self-identified race, we compared parameter estimates as host factors were added to subsequent models to test whether the effect of self-identified race became attenuated after adjustment for each host factor or remained independently associated with vaginal CSTs.Models were run for the pooled cohort data as well as for each cohort individually using the R multinom() function from the nnet package.
We next examined how the factors retained in the multiple adjusted models associated with the relative abundance of specific taxa within communities.Using a machine learning approach utilizing a random forest classifier for each robust host factor in the PERMANOVA models, we ranked specific taxa that contributed the largest amount of homogeneity in the nodes and leaves of the forest trees by estimating the mean decrease in Gini index and increase in node purity coefficients for categorical and continuous predictors, respectively.Gini coefficients are estimated each time the tree is split on each feature, with higher values ranking greater discrimination.Random forest classifiers were run using a test and train validation set approach, splitting the data to set aside 20% for testing (80/20 split).We generated supervised random forest model plots and validated them Self-collected vaginal Starplex star dual-headed swabs at 3 times across gestation beginning at 7 to > 28 weeks gestation Clinic sample preserved in All-Protect upon collection 16S rRNA gene (V4 region) Qiagen PowerMicrobiome DNA/RNA EP kit MARCH (non-U-M sites) Recruited from 9 other non-UM sites across Michigan Self-collected vaginal Starplex star dual-headed swabs and fecal (no preservative) > 28 weeks gestation Mail collection 16S rRNA gene (V4 region) Qiagen/Mobio power soil DNA extraction kit Atlanta Recruited from prenatal public/private clinics in Atlanta Vaginal (self-collected, mid-vaginal), along with oral and rectal Epicentre Catchall swabs, collected at 8-14 and 24-30 weeks gestation 16S rRNA gene (V3-V4 region) Qiagen/Mobio power soil DNA extraction kit CREW (WISC and MAAP) WISC recruited farm and rural-nonfarm families in Wisconsin.MAAP recruited families in Detroit Vaginal-rectal swabs sampled from clinic at > 28 weeks gestation 16S rRNAgene (V4 region) SequalPrep Normalization Plate Kit (Thermo Fisher Scientific)

Figure 1 .
Figure 1.(a) Principal coordinates (PCoA) analysis of Bray-Curtis distances between samples based on amplicon sequence variants (ASVs) and (b) phylotypes demonstrating that using phylogenetic placement of ASVs on a reference tree removed a large degree of variation by site.

Figure 4 .
Figure 4. Associations between vaginal microbiota community structure and host factors across cohorts.Single-adjusted (left) and multifactor-adjusted (left) PERMANOVA estimates of vaginal microbiota structure in pregnancy using Bray-Curtis distances by individual cohort.Only factors that were significantly associated with composition (p < 0.05) in single-factor models were retained for inclusion in the multifactor model using backward selection.

Table 2 .
Characteristics . Four CSTs were Lactobacillus dominant: CST I, L. crispatus; CST II, L. gasseri; CST III, L. iners; and CST V, L. jenseni.The remainder were diverse polymicrobial communities: CST IV-B, characterized by high Gardnerella by ECHO cohort.Data shown (except maternal age) are n (%).BA or higher college degree or higher, HS/GED high school or general equivalency degree, BMI body mass index.*Self-identified as a proxy for lived experiences.Fisher's exact test; Kruskal-Wallis rank sum test; Pearson's chi-squared test.

Decrease Gini Parity Category Figure 5.
Random forest panel plots of the top ranked vaginal taxonomic features predictive of host factors.

Table 3 .
Host factors are independently associated with vaginal community state types in pregnancy.OR odds ratio, CI confidence interval, BA or higher college degree or higher, HS/GED high school or general equivalency degree.Multinomial logistic regression models.Reference is non-L.inersLactobacillusdominant (I, II, V) community state type.Model1 is adjusted for selfidentified White race.Model 2 is adjusted for self-identified White race, host age.Model 3 is adjusted for self-identified White race, host age, educational attainment.Model 4 is adjusted for self-identified White race, host age, educational attainment, parity.Model 5 is adjusted for self-identified White race, host age, educational attainment, parity, and antibiotics use in pregnancy.